The First Tour of the IPython Notebook

Why IPython Notebook?

I still remember that the first time I'd seen IPython Notebook was at Taipei.py. At that time, I wasn't sure why it's a good idea to write Python programs on a restricted environment in a browser. I mean, I couldn't even use my favorite Vim commands! I tried installing it and wrote a few short programs, but ended up not being interested to continue using the Notebook. On the other hand, I found the IPython interactive shell to be more convenient than the original Python shell, so I started using it more and more often when I needed to issue short Python commands.

My second experience with IPython Notebook was from the course materials of CS231n. In that course, each assignment was a IPython Notebook. You could complete the code on either the Notebook itself or independent Python scripts. The results could be evaluated immediately on the Notebook. I realized this was a fantastic way to share and communicate! There was also the nbviewer, which made it so easy to view all those Notebooks without having to actually set up the Python environment.

As I gained more experience with machine learning tasks using Python, I started to understand why IPython Notebook was so popular among the scientific computing community. I believed one of the reasons must be that it provided a very simple way to record everything you did.

For example, when doing data science, there is a need to manage not just the source code but also the data. Tasks such as data preprocessing, data cleaning and feature extraction all requires some transformations of the data. Oftentimes, it seems that some of the transformations would be done for only once, and it is very tempting to just issue the command without even recording what has been done. This could be a disaster when you want to rerun the experiments under different settings for the early stages of the pipeline. Even if you do put down the commands on Python scripts, it is still difficult to figure out the order and parameters for these scripts later. On the other hand, if you put much effort to write a single script for all the data transformation and analysis straight from the original data every time runs, the running time may become unacceptable when dealing with big data. That's where IPython Notebook shines. It is a perfect notebook to record everything.

Data Analysis with IPython Notebook

So let's get started with our tour of IPython Notebook. Some simple tasks will be demonstrated using several libraries including:

  1. MPLD3
  2. matplotlib
  3. Numpy
  4. pandas
  5. scikit-learn
  6. wordcloud

In particular, matplotlib is a powerful graphing package for data visualization, and the close integration with IPython Notebook makes it even more useful.

Firstly, we use %matplotlib inline magic command to make matplotlib display plots directly inside the Notebook.


In [1]:
%matplotlib inline

Showing the Most Important Features

To get intuition on a model, finding the features that have the largest weights are often helpful. We will use the polarity dataset for the demonstration:


In [2]:
! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
! tar xzf review_polarity.tar.gz


--2015-12-26 16:14:15--  http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘review_polarity.tar.gz’

100%[======================================>] 3,127,238    655KB/s   in 5.6s   

2015-12-26 16:14:21 (543 KB/s) - ‘review_polarity.tar.gz’ saved [3127238/3127238]

Firstly load the required modules:


In [3]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

We use TfidfVectorizer to get the TF-IDF feature vector for each sentence:


In [4]:
sent_data = load_files('txt_sentoken')

tfidf_vec = TfidfVectorizer()

sent_X = tfidf_vec.fit_transform(sent_data.data)
sent_y = sent_data.target

LinearSVC is used to train a classifier for positive and negative sentiments.


In [5]:
lsvc = LinearSVC()
lsvc.fit(sent_X, sent_y)


Out[5]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Finally, we show the most important features learned by the classifier.


In [6]:
def display_top_features(weights, names, top_n):
    top_features = sorted(zip(weights, names), key=lambda x: abs(x[0]), reverse=True)[:top_n]
    top_weights = [x[0] for x in top_features]
    top_names = [x[1] for x in top_features]
    
    fig, ax = plt.subplots(figsize=(16,8))
    ind = np.arange(top_n)
    bars = ax.bar(ind, top_weights, color='blue', edgecolor='black')
    for bar, w in zip(bars, top_weights):
        if w < 0:
            bar.set_facecolor('red')
    
    width = 0.30
    ax.set_xticks(ind + width)
    ax.set_xticklabels(top_names, rotation=45, fontsize=12)
    
    plt.show(fig)

display_top_features(lsvc.coef_[0], tfidf_vec.get_feature_names(), 20)


Word clouds are also an interesting way to show relative importance for different words:


In [7]:
from wordcloud import WordCloud

In [8]:
def generate_word_cloud(weights, names):
    return WordCloud(width=350, height=250).generate_from_frequencies(zip(names, weights))

def display_word_cloud(weights, names):
    fig, ax = plt.subplots(1, 2, figsize=(28, 10))
    
    pos_weights = weights[weights > 0]
    pos_names = np.array(names)[weights > 0]
    
    neg_weights = np.abs(weights[weights < 0])
    neg_names = np.array(names)[weights < 0]
    
    lst = [('Positive', pos_weights, pos_names), ('Negative', neg_weights, neg_names)]
    
    for i, (label, weights, names) in enumerate(lst):
        wc = generate_word_cloud(weights, names)
        ax[i].imshow(wc)
        ax[i].set_axis_off()
        ax[i].set_title('{} words'.format(label), fontsize=24)
    
    plt.show(fig)

display_word_cloud(lsvc.coef_[0], tfidf_vec.get_feature_names())


Visualization with Dimensionality Reduction

It's often difficult to understand data with high dimensionality. Therefore, dimensionality reduction is often used to help visualization. Here we will use t-SNE for the Iris flower data set. Additionally, we use MPLD3 to produce figures that could be zoomed in and zoomed out.


In [9]:
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import mpld3

In [10]:
iris = load_iris()

In [11]:
def display_iris(data):
    X_tsne = TSNE(n_components=2, perplexity=20, learning_rate=50).fit_transform(data.data)
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].scatter(X_tsne[:, 0], X_tsne[:, 1])
    ax[0].set_title('All instances', fontsize=14)
    ax[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=data.target)
    ax[1].set_title('All instances labeled with color', fontsize=14)
    
    return mpld3.display(fig)

display_iris(iris)


Out[11]:

As we could see, t-SNE does quite well to separate data points of different types even without knowing the label. Let's try a more complicated example with the MNIST dataset of handwritten digits. We will also use PointLabelTooltip to display the labels as tooltips.


In [12]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA

In [13]:
mnist = fetch_mldata('MNIST original')

In [14]:
def display_mnist(data, n_samples):
    X, y = data.data / 255.0, data.target
    
    # downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
    
    
    X_tsne = TSNE(n_components=2, perplexity=30).fit_transform(X_train)
    X_pca = PCA(n_components=2).fit_transform(X_train)
    
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))


    points = ax[0].scatter(X_tsne[:,0], X_tsne[:,1], c=y_train)
    tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
    mpld3.plugins.connect(fig, tooltip)
    ax[0].set_title('t-SNE')
    
    points = ax[1].scatter(X_pca[:,0], X_pca[:,1], c=y_train)
    tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
    mpld3.plugins.connect(fig, tooltip)
    ax[1].set_title('PCA')
    
    
    return mpld3.display(fig)

display_mnist(mnist, 1000)


Out[14]:

If your aim is to learn a projection vector when labels are available for the training data. LDA could also be used.


In [15]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.lda import LDA

In [16]:
def display_mnist_3d(data, n_samples):
    X, y = data.data / 255.0, data.target
    
    # downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
    
    X_lda = LDA(n_components=3).fit_transform(X_train, y_train)
    
    
    fig, ax = plt.subplots(figsize=(10,10), subplot_kw={'projection':'3d'})
    
    points = ax.scatter(X_lda[:,0], X_lda[:,1], X_lda[:,2] , c=y_train)
    ax.set_title('LDA')
    ax.set_xlim((-6, 6))
    ax.set_ylim((-6, 6))
    
    
    plt.show(fig)
    
display_mnist_3d(mnist, 1000)


Data exploration with Pandas

Pandas is quite useful for data analysis. Let's use the Meta Kaggle dataset to see how users are doing on the Kaggle website.


In [17]:
import pandas as pd
import sqlite3

After manually downloading the dataset, we extract the zipped file. There should be a output directory, containing the files.


In [18]:
con = sqlite3.connect('output/database.sqlite')
kaggle_df = pd.read_sql_query('''
SELECT * FROM Submissions''', con)

Display some entries:


In [19]:
kaggle_df.head()


Out[19]:
Id SubmittedUserId DateSubmitted TeamId PrivateScore PublicScore IsSelected ScoreStatus IsAfterDeadline DateScored ScoringDurationMilliseconds
0 2180 647 2010-04-29 22:32:08 496 56.2139 55.7692 False 1 False
1 2181 619 2010-04-30 09:38:29 497 50 47.1154 False 1 False
2 2182 619 2010-04-30 09:48:50 497 65.6069 61.0577 False 1 False
3 2184 663 2010-05-01 11:02:52 499 50 47.1154 False 1 False
4 2185 673 2010-05-02 08:04:38 500 62.2832 61.0577 False 1 False

Now, we would like to analyse the submission times. Firstly we obtain the day of week and the hour of week for each submission.


In [20]:
print('There is {} submissions'.format(kaggle_df.shape[0]))

# convert time strings to DatetimeIndex
kaggle_df['timestamp'] = pd.to_datetime(kaggle_df['DateSubmitted'])

print('The earliest and latest submissions are on {} and {}'.format(kaggle_df['timestamp'].min(), kaggle_df['timestamp'].max()))

kaggle_df['weekday'] = kaggle_df['timestamp'].dt.weekday
kaggle_df['weekhr'] = kaggle_df['weekday'] * 24 + kaggle_df['timestamp'].dt.hour


There is 934345 submissions
The earliest and latest submissions are on 2010-04-29 22:32:08 and 2015-08-31 23:58:44.050000

In [21]:
import calendar

In [22]:
def display_kaggle(df):
    fig, ax = plt.subplots(1, 2, figsize=(16, 8))
    
    ax[0].set_title('submissions per weekday')
    df['weekday'].value_counts().sort_index().rename_axis(lambda x: calendar.day_name[x]).plot.bar(ax=ax[0])
    
    ax[1].set_title('submissions per hour of week')
    ax[1].set_xticks(np.linspace(0, 24*7, 8))
    df['weekhr'].value_counts().sort_index().plot(color='red', ax=ax[1])
    plt.show(fig)
    
display_kaggle(kaggle_df)


Next, we try to cluster the users based on their submission patterns to see whether different groups might like to submit at different times.


In [23]:
from collections import defaultdict
from sklearn.cluster import KMeans

In [24]:
def display_hr(df, n_clusters):
    hrs_per_user = df[['SubmittedUserId', 'weekhr', 'Id']].groupby(['SubmittedUserId', 'weekhr']).count()
    total_per_user = hrs_per_user.sum(axis=0, level=0)
    user_patterns = (hrs_per_user / total_per_user)['Id']
    
    vectors = defaultdict(lambda: np.zeros(24*7))
    for (u, hr), r in user_patterns.items():
        vectors[u][hr] = r
    X_hr = np.array(list(vectors.values()))
    
    y = KMeans(n_clusters=n_clusters, random_state=3).fit_predict(X_hr)
    
    for i in range(n_clusters):
        fig, ax = plt.subplots(figsize=(6, 6))
        indices = y == i
        X = X_hr[indices]
        ax.plot(np.arange(24*7), X.mean(axis=0))
        ax.set_xticks(np.linspace(0, 24*7, 8))
        ax.set_xlim((0, 24*7))
        ax.set_title('Cluster #{}, n = {}'.format(i, len(X)), fontsize=14)
        plt.show(fig)
        
display_hr(kaggle_df, 9)


It seems that the users from Cluster#1 and Cluster#8 might indeed be active at different times. What do you think?

XKCD

Finally, let's draw a XKCD-style plot with matplotlib! To be able to draw this, we need to install Humor Sans, and clean the font cache directory. To get the path of the cache, use:

import matplotlib
matplotlib.get_cachedir()

As we use Python 3, additional packages are also required:

sudo apt-get install libffi-dev
pip3 install cairocffi

In [25]:
def xkcd():
    with plt.xkcd():
        fig, ax = plt.subplots()
        ax.spines['right'].set_color('none')
        ax.spines['top'].set_color('none')
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_ylim([-1, 10])
        
        data = np.zeros(100)
        data[:60] += np.linspace(-1, 0, 60)
        data[60:75] += np.arange(15)
        data[75:] -= np.ones(25)
        
        
        ax.annotate(
            'DEADLINE',
            xy=(71, 7), arrowprops=dict(arrowstyle='->'), xytext=(30, 2))
        
        ax.plot(data)
        ax.plot([72, 72], [-1, 15], 'k-', color='red')

        
        ax.set_xlabel('time')
        ax.set_ylabel('productivity')
        ax.set_title('productivity under a deadline')
        
        plt.show(fig)
xkcd()


Reference